Search CORE

69 research outputs found

MCP: Self-supervised Pre-training for Personalized Chatbots with Multi-level Contrastive Sampling

Author: Dou Zhicheng
Huang Zhaoheng
Ma Zhengyi
Zhu Yutao
Publication venue
Publication date: 14/12/2022
Field of study

Personalized chatbots focus on endowing the chatbots with a consistent personality to behave like real users and further act as personal assistants. Previous studies have explored generating implicit user profiles from the user's dialogue history for building personalized chatbots. However, these studies only use the response generation loss to train the entire model, thus it is prone to suffer from the problem of data sparsity. Besides, they overemphasize the final generated response's quality while ignoring the correlations and fusions between the user's dialogue history, leading to rough data representations and performance degradation. To tackle these problems, we propose a self-supervised learning framework MCP for capturing better representations from users' dialogue history for personalized chatbots. Specifically, we apply contrastive sampling methods to leverage the supervised signals hidden in user dialog history, and generate the pre-training samples for enhancing the model. We design three pre-training tasks based on three types of contrastive pairs from user dialogue history, namely response pairs, sequence augmentation pairs, and user pairs. We pre-train the utterance encoder and the history encoder towards the contrastive objectives and use these pre-trained encoders for generating user profiles while personalized response generation. Experimental results on two real-world datasets show a significant improvement in our proposed model MCP compared with the existing methods

arXiv.org e-Print Archive

JDsearch: A Personalized Product Search Dataset with Real Queries and Full Interactions

Author: Dou Zhicheng
Liu Jiongnan
Tang Guoyu
Xu Sulong
Publication venue
Publication date: 24/05/2023
Field of study

Recently, personalized product search attracts great attention and many models have been proposed. To evaluate the effectiveness of these models, previous studies mainly utilize the simulated Amazon recommendation dataset, which contains automatically generated queries and excludes cold users and tail products. We argue that evaluating with such a dataset may yield unreliable results and conclusions, and deviate from real user satisfaction. To overcome these problems, in this paper, we release a personalized product search dataset comprised of real user queries and diverse user-product interaction types (clicking, adding to cart, following, and purchasing) collected from JD.com, a popular Chinese online shopping platform. More specifically, we sample about 170,000 active users on a specific date, then record all their interacted products and issued queries in one year, without removing any tail users and products. This finally results in roughly 12,000,000 products, 9,400,000 real searches, and 26,000,000 user-product interactions. We study the characteristics of this dataset from various perspectives and evaluate representative personalization models to verify its feasibility. The dataset can be publicly accessed at Github: https://github.com/rucliujn/JDsearch.Comment: Accepted to SIGIR 202

arXiv.org e-Print Archive

Retrieve Anything To Augment Large Language Models

Author: Dou Zhicheng
Liu Zheng
Nie Jian-Yun
Xiao Shitao
Zhang Peitian
Publication venue
Publication date: 25/10/2023
Field of study

Large language models (LLMs) face significant challenges stemming from their inherent limitations in knowledge, memory, alignment, and action. These challenges cannot be addressed by LLMs alone, but should rely on assistance from the external world, such as knowledge base, memory store, demonstration examples, and tools. Retrieval augmentation stands as a vital mechanism for bridging the gap between LLMs and the external assistance. However, conventional methods encounter two pressing issues. On the one hand, the general-purpose retrievers are not properly optimized for the retrieval augmentation of LLMs. On the other hand, the task-specific retrievers lack the required versatility, hindering their performance across the diverse retrieval augmentation scenarios. In this work, we present a novel approach, the LLM-Embedder, which comprehensively supports the diverse retrieval augmentation needs of LLMs with one unified embedding model. Training such a unified model is non-trivial, as various retrieval tasks aim to capture distinct semantic relationships, often subject to mutual interference. To address this challenge, we systematically optimize our training methodology. This includes reward formulation based on LLMs' feedback, the stabilization of knowledge distillation, multi-task fine-tuning with explicit instructions, and homogeneous in-batch negative sampling. These optimization strategies contribute to the outstanding empirical performance of the LLM-Embedder. Notably, it yields remarkable enhancements in retrieval augmentation for LLMs, surpassing both general-purpose and task-specific retrievers in various evaluation scenarios. Our checkpoint and source code are publicly available at https://github.com/FlagOpen/FlagEmbedding

arXiv.org e-Print Archive

Halothiobacillus neapolitanus Carboxysomes Sequester Heterologous and Chimeric RubisCO Species

Author: Cannon Gordon C.
Dou Zhicheng
Heinhorst Sabine
Menon Balaraj B.
Shively Jessup M.
Publication venue: The Aquila Digital Community
Publication date: 01/01/2008
Field of study

Background: The carboxysome is a bacterial microcompartment that consists of a polyhedral protein shell filled with ribulose 1,5-bisphosphate carboxylase/oxygenase (RubisCO), the enzyme that catalyzes the first step of CO(2) fixation via the Calvin-Benson-Bassham cycle. Methodology/Principal Findings: To analyze the role of RubisCO in carboxysome biogenesis in vivo we have created a series of Halothiobacillus neapolitanus RubisCO mutants. We identified the large subunit of the enzyme as an important determinant for its sequestration into alpha-carboxysomes and found that the carboxysomes of H. neapolitanus readily incorporate chimeric and heterologous RubisCO species. Intriguingly, a mutant lacking carboxysomal RubisCO assembles empty carboxysome shells of apparently normal shape and composition. Conclusions/Significance: These results indicate that carboxysome shell architecture is not determined by the enzyme they normally sequester. Our study provides, for the first time, clear evidence that carboxysome contents can be manipulated and suggests future nanotechnological applications that are based upon engineered protein microcompartments

Aquila Digital Community

CiteSeerX

Directory of Open Access Journals

PubMed Central

RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit

Author: Cheng Jiehan
Dou Zhicheng
Jin Jiajie
Liu Jiongnan
Wang Zihan
Wen Ji-Rong
Publication venue
Publication date: 08/06/2023
Field of study

Although Large Language Models (LLMs) have demonstrated extraordinary capabilities in many domains, they still have a tendency to hallucinate and generate fictitious responses to user requests. This problem can be alleviated by augmenting LLMs with information retrieval (IR) systems (also known as retrieval-augmented LLMs). Applying this strategy, LLMs can generate more factual texts in response to user input according to the relevant content retrieved by IR systems from external corpora as references. In addition, by incorporating external knowledge, retrieval-augmented LLMs can answer in-domain questions that cannot be answered by solely relying on the world knowledge stored in parameters. To support research in this area and facilitate the development of retrieval-augmented LLM systems, we develop RETA-LLM, a {RET}reival-{A}ugmented LLM toolkit. In RETA-LLM, we create a complete pipeline to help researchers and users build their customized in-domain LLM-based systems. Compared with previous retrieval-augmented LLM systems, RETA-LLM provides more plug-and-play modules to support better interaction between IR systems and LLMs, including {request rewriting, document retrieval, passage extraction, answer generation, and fact checking} modules. Our toolkit is publicly available at https://github.com/RUC-GSAI/YuLan-IR/tree/main/RETA-LLM.Comment: Technical Report for RETA-LL

arXiv.org e-Print Archive

Characterization of the chloroquine resistance transporter homologue in Toxoplasma gondii

Author: Carruthers Vern B.
Dou Zhicheng
McFadden Geoffrey I
van Dooren Giel
Warring Sally D
Publication venue: 'American Society for Microbiology'
Publication date: 01/11/2014
Field of study

Mutations in the Plasmodium falciparum chloroquine resistance transporter (PfCRT) protein confer resistance to the antima-larial drug chloroquine. PfCRT localizes to the parasite digestive vacuole, the site of chloroquine action, where it mediates resistance by transporting chloroquine out of the digestive vacuole. PfCRT belongs to a family of transporter proteins called the chlo-roquine resistance transporter family. CRT family proteins are found throughout the Apicomplexa, in some protists, and in plants. Despite the importance of PfCRT in drug resistance, little is known about the evolution or native function of CRT proteins. The apicomplexan parasite Toxoplasma gondii contains one CRT family protein. We demonstrate that T. gondii CRT (TgCRT) colocalizes with markers for the vacuolar (VAC) compartment in these parasites. The TgCRT-containing VAC is a highly dynamic organelle, changing its morphology and protein composition between intracellular and extracellular forms of the parasite. Regulated knockdown of TgCRT expression resulted in modest reduction in parasite fitness and swelling of the VAC, indicating that TgCRT contributes to parasite growth and VAC physiology. Together, our findings provide new information on the role of CRT family proteins in apicomplexan parasites

PubMed Central

The Australian National University